78
Algorithms for Binary Neural Networks
FIGURE 3.23
We demonstrate the kernel weight distribution of the first binarized convolution layer of
BONNs. Before training, we initialize the kernels as a single-mode Gaussian distribution.
From the 2-th epoch to the 200-th epoch, with λ fixed to 1e −4, the distribution of the
kernel weights becomes more and more compact with two modes, which confirms that the
Bayesian kernel loss can regularize the kernels into a promising distribution for binarization.
two-mode GMM style. Figure 3.25 shows the evolution of the binarized values during the
training process of XNOR-Net and BONN. The two different patterns indicate that the
binarized values learned in BONN are more diverse.
Effectiveness of Bayesian Feature Loss on Real-Valued Models: We apply our
Bayesian feature loss on real-value models, including ResNet-18 and ResNet-50 [84]. We
retrain these two backbones with our Bayesian feature loss for 70 epochs. We set the hy-
perparameter θ to 1e −3. The SGD optimizer has an initial learning rate set to 0.1. We use
FIGURE 3.24
The weight distributions of XNOR and BONN are based on WRN-22 (2nd, 8th, and 14th
convolutional layers) after 200 epochs. The weight distribution difference between XNOR
and BONN indicates that the kernels are regularized across the convolutional layers with
the proposed Bayesian kernel loss.